% !TEX TS-program = pdflatex
% !TEX encoding = UTF-8 Unicode

% This is a simple template for a LaTeX document using the "article" class.
% See "book", "report", "letter" for other types of document.

\documentclass[11pt]{report} % use larger type; default would be 10pt


\title{Distributed Tree Kernels\\\emph{Short Survival Manual}}

\author{Fabio Massimo Zanzotto, Lorenzo Dell'Arciprete}

\begin{document}

\maketitle
\newpage
\tableofcontents


\chapter{How to use the DTK as a batch process}

To use the package as a command to produce Distributed Trees,  the class to use is it.uniroma2.dtk.main.DTBuilder
You can xecute:\\
\begin{center}
      java -cp target/DTK-[VersionNumber].jar it.uniroma2.dtk.main.DTBuilder
\end{center}
to obtain the current list of parameters:
\begin{itemize}
 \item \texttt{input <input file>}:     load trees (in Penn Treebank notation) from the given file
 \item \texttt{output <output file>}:   print distributed trees to the given file 
 \item \texttt{of [dsm|dbm]}: the format of the output file:  dense string matrix (dsm) and dense binary matrix (dbm), default is dsm. The dense binary matrix is defined as the dense matrix, binary format of SVDLIBC\footnote{http://tedlab.mit.edu/~dr/SVDLIBC/}. 

 \item \texttt{randomSeed <seed>}:      use given random seed (default = 0)
 \item \texttt{vectorSize <size>} :     use given vector size (default = 4096)

\item \texttt{lambda <lambda>}:        use given lambda to weight tree fragments
                         (default = 1)

\item \texttt{op <operation class name>}:        the fully qualified name of the Java class realizing the desired vector composition operation; the following ready-to-use classes are available:
\begin{itemize} 
 \item \texttt{it.uniroma2.dtk.op.convolution.ShuffledCircularConvolution}
 \item \texttt{it.uniroma2.dtk.op.convolution.ShiftedCircularConvolution}
 \item \texttt{it.uniroma2.dtk.op.convolution.ReverseCircularConvolution}
 \item \texttt{it.uniroma2.dtk.op.product.ShuffledGammaProduct}
 \item \texttt{it.uniroma2.dtk.op.product.ShiftedGammaProduct}
 \item \texttt{it.uniroma2.dtk.op.product.ReverseGammaProduct}
\end{itemize} 
Default is \texttt{it.uniroma2.dtk.op.convolution.ShuffledCircularConvolution}.
Custom classes can be built by implementing the \texttt{it.uniroma2.dtk.op.IdealOperation} interface.
 


\item Additional settings:
\begin{itemize} 
 \item \texttt{not\_lexicalized}:           does not consider leaf nodes
 \item \texttt{pos}:                    use pos augmented labels for leaf nodes in
                         (lexicalized) syntactic trees,  e.g., \emph{sun::n}
\end{itemize} 

\item Using Weka as data converter:
\begin{itemize} 
 \item \texttt{weka}:           use a weka input type and a weka output converter (default is weka.core.converters.ArffSaver but it can be specified). Each attribute that ends with \emph{:tree} is converted in a list of attributes representing the tree as distributed tree
 \item \texttt{wekaconverter}: select the weka output format: use the full name of the weka.core.converters.AbstractFileSaver. The possibilities in Weka 3.7.7 are:  ArffSaver, C45Saver, CSVSaver, JSONSaver, LibSVMSaver, MatlabSaver, SerializedInstancesSaver, SVMLightSaver, XRFFSaver.  

\end{itemize} 


\end{itemize}



For testing the software, run:
\begin{center}
	java -jar target/DTK-0.1.jar -input SampleInput.dat \\ -output SampleOutputNew.dat -vectorSize 4096
\end{center}
and check that SampleOutputNew.dat and SampleOutput.dat are not different.

\section{Weka-standardized input and weka converters for the output}

As input, you can use a weka-standardized input. The standard features are augmented with features repersenting trees. This new type is defined with the extension \texttt{:tree}, e.g.:\\

\begin{tabular}{l}
\texttt{@attribute ...}\\
\texttt{@attribute first\_tree:tree string}\\
\texttt{@attribute ...}\\
\end{tabular}

\vspace{3em}

Trees in the data section are represented in a parenthetical way, e.g.:\\

\begin{small}
\begin{tabular}{l}
\texttt{"(ROOT (SBARQ (WHNP (WP What))(SQ (VP (VBZ is)(ADJP (JJ e-coli))))(. ?)))"}
\end{tabular}
\end{small}

Each tree will be transformed in a vector of features \\
\begin{tabular}{l}
\texttt{@attribute first\_tree:tree\_0 numeric}\\
\texttt{@attribute first\_tree:tree\_1 numeric}\\
\texttt{@attribute ...}\\
\texttt{@attribute first\_tree:tree\_<N-1> numeric}\\
\end{tabular}\\
representing the distributed tree of the initial tree where \texttt{N} is the vector dimension.

You can output the generated trees in any of the output provided by the weka converter and you can use it with your favourite machine learning algorithm.




\chapter{How to generate Distributed Trees in your code}

First of all, initialize a distributed tree generator by customizing for your convenience the it.uniroma2.dtk.dt.GenericDT class.
 
\begin{verbatim}
GenericDT generator = new GenericDT( int randomSeed, 
				         int vectorSize,
				         double lambda, 
				         IdealOperation opImplementation);
\end{verbatim}
where: 
\begin{itemize} 
\item \texttt{int randomSeed} is the random seed for the initialization of the random generator,
\item \texttt{int vectorSize} is the size for the reduced space,
\item \texttt{IdealOperation opImplementation} is an object implementing the basic operation for composing vectors.
\end{itemize} 

Some ready-to-use implementations of the \texttt{it.uniroma2.dtk.op.IdealOperation} interface are available in package \texttt{it.uniroma2.dtk.op}. These classes need to be initialized through method \texttt{initialize(VectorProvider)} after instantiation. Any class that respects this instantiation/initialization procedure (i.e. a null argument constructor followed by the invocation of method \texttt{initialize(VectorProvider)}) can also be managed entirely by \texttt{GenericDT}, by invoking the constructor:

\begin{verbatim}
GenericDT generator = new GenericDT( int randomSeed, 
				         int vectorSize,
				         double lambda, 
				         Class<?> opImplementationClass);
\end{verbatim}

Once initialized:
\begin{itemize} 
\item
to obtain Distributed Trees use:
\begin{center}
\begin{verbatim}
double [] dt = generator.dt(Tree.fromPennTree(treeString))
\end{verbatim}
\end{center}
where \texttt{treeString} is the tree in a penntree format and \texttt{dt} is the distributed tree in the specified dimension;
\item to obtain Distributed Tree Fragment use:
\begin{center}
\begin{verbatim}
double [] dtf = generator.dtf(Tree.fromPennTree(treeString))
\end{verbatim}
\end{center}
where \texttt{treeString} is the tree in a penntree format and \texttt{dtf} is the distributed tree fragment in the specified dimension;
\end{itemize} 





\end{document}